{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "99cea58c-48bc-4af6-8358-df9695659983",
   "metadata": {},
   "source": [
    "# Benchmarking OpenAI Retrieval API (through Assistant Agent)\n",
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/agent/openai_retrieval_benchmark.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "This guide benchmarks the Retrieval Tool from the [OpenAI Assistant API](https://platform.openai.com/docs/assistants/overview), by using our `OpenAIAssistantAgent`. We run over the Llama 2 paper, and compare generation quality against a naive RAG pipeline.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c61c873d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24fce591-8b3a-4c5f-985e-2669a05595bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0906a181-30ea-4f04-8307-215b988ea89b",
   "metadata": {},
   "source": [
    "## Setup Data\n",
    "\n",
    "Here we load the Llama 2 paper and chunk it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae2b6c4c-12db-4e82-be83-76507f4cb938",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2023-11-08 21:53:52--  https://arxiv.org/pdf/2307.09288.pdf\n",
      "Resolving arxiv.org (arxiv.org)... 128.84.21.199\n",
      "Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 13661300 (13M) [application/pdf]\n",
      "Saving to: ‘data/llama2.pdf’\n",
      "\n",
      "data/llama2.pdf     100%[===================>]  13.03M   141KB/s    in 1m 48s  \n",
      "\n",
      "2023-11-08 21:55:42 (123 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p 'data/'\n",
    "!wget --user-agent \"Mozilla\" \"https://arxiv.org/pdf/2307.09288.pdf\" -O \"data/llama2.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ce1f3a3-740c-43a2-a755-94f13a2c9762",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from llama_index import Document, ServiceContext, VectorStoreIndex\n",
    "from llama_hub.file.pymu_pdf.base import PyMuPDFReader\n",
    "from llama_index.node_parser import SimpleNodeParser\n",
    "from llama_index.llms import OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0fd2ddf-b13f-498d-ab80-d8d456ca955d",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = PyMuPDFReader()\n",
    "docs0 = loader.load(file_path=Path(\"./data/llama2.pdf\"))\n",
    "\n",
    "doc_text = \"\\n\\n\".join([d.get_content() for d in docs0])\n",
    "docs = [Document(text=doc_text)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65105bdd-2d4f-4684-b7f1-a24b677b4df6",
   "metadata": {},
   "outputs": [],
   "source": [
    "node_parser = SimpleNodeParser.from_defaults()\n",
    "nodes = node_parser.get_nodes_from_documents(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be838a39-2c7e-43c6-b684-e699595f63a8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "89"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55fa821a-d459-44aa-ab38-5c74184b9cd9",
   "metadata": {},
   "source": [
    "## Define Eval Modules\n",
    "\n",
    "We setup evaluation modules, including the dataset and evaluators."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49f83fd3-57d6-4bf2-bb28-5998e96b0e43",
   "metadata": {},
   "source": [
    "### Setup \"Golden Dataset\"\n",
    "\n",
    "Here we load in a \"golden\" dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55ca5fba-4824-464a-b25e-8d470755e692",
   "metadata": {},
   "source": [
    "#### Option 1: Pull Existing Dataset\n",
    "\n",
    "**NOTE**: We pull this in from Dropbox. For details on how to generate a dataset please see our `DatasetGenerator` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9dee93c1-9e1b-4957-83e5-d1b46ff3482d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2023-11-08 22:20:10--  https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1\n",
      "Resolving www.dropbox.com (www.dropbox.com)... 2620:100:6057:18::a27d:d12, 162.125.13.18\n",
      "Connecting to www.dropbox.com (www.dropbox.com)|2620:100:6057:18::a27d:d12|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1# [following]\n",
      "--2023-11-08 22:20:11--  https://uc63170224c66fda29da619e304b.dl.dropboxusercontent.com/cd/0/inline/CHOj1FEf2Dd6npmREaKmwUEIJ4S5QcrgeISKh55BE27i9tqrcE94Oym_0_z0EL9mBTmF9udNCxWwnFSHlio3ib6G_f_j3xiUzn5AVvQsKDPROYjazkJz_ChUVv3xkT-Pzuk/file?dl=1\n",
      "Resolving uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)... 2620:100:6057:15::a27d:d0f, 162.125.13.15\n",
      "Connecting to uc63170224c66fda29da619e304b.dl.dropboxusercontent.com (uc63170224c66fda29da619e304b.dl.dropboxusercontent.com)|2620:100:6057:15::a27d:d0f|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 60656 (59K) [application/binary]\n",
      "Saving to: ‘data/llama2_eval_qr_dataset.json’\n",
      "\n",
      "data/llama2_eval_qr 100%[===================>]  59.23K  --.-KB/s    in 0.02s   \n",
      "\n",
      "2023-11-08 22:20:12 (2.87 MB/s) - ‘data/llama2_eval_qr_dataset.json’ saved [60656/60656]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget \"https://www.dropbox.com/scl/fi/fh9vsmmm8vu0j50l3ss38/llama2_eval_qr_dataset.json?rlkey=kkoaez7aqeb4z25gzc06ak6kb&dl=1\" -O data/llama2_eval_qr_dataset.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a256715-0b45-42c3-9218-3ca04fe07f4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.evaluation import QueryResponseDataset\n",
    "\n",
    "# optional\n",
    "eval_dataset = QueryResponseDataset.from_json(\n",
    "    \"data/llama2_eval_qr_dataset.json\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c5b5644-a668-4c20-af4b-cbffd8c8c6fe",
   "metadata": {},
   "source": [
    "#### Option 2: Generate New Dataset\n",
    "\n",
    "If you choose this option, you can choose to generate a new dataset from scratch. This allows you to play around with our `DatasetGenerator` settings to make sure it suits your needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a50f429-5b8e-476d-8031-a9a6dfe5cd53",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.evaluation import (\n",
    "    DatasetGenerator,\n",
    "    QueryResponseDataset,\n",
    ")\n",
    "from llama_index import ServiceContext\n",
    "from llama_index.llms import OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5e09743-1781-4959-95b5-6a71d8ef676d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOTE: run this if the dataset isn't already saved\n",
    "# Note: we only generate from the first 20 nodes, since the rest are references\n",
    "eval_service_context = ServiceContext.from_defaults(\n",
    "    llm=OpenAI(model=\"gpt-4-1106-preview\")\n",
    ")\n",
    "dataset_generator = DatasetGenerator(\n",
    "    nodes[:20],\n",
    "    service_context=eval_service_context,\n",
    "    show_progress=True,\n",
    "    num_questions_per_chunk=3,\n",
    ")\n",
    "eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)\n",
    "eval_dataset.save_json(\"data/llama2_eval_qr_dataset.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63154cc9-4536-4fb4-a5ab-30bd8966e46d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# optional\n",
    "eval_dataset = QueryResponseDataset.from_json(\n",
    "    \"data/llama2_eval_qr_dataset.json\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "617e4c39-2a92-455f-982d-fd565309d9e9",
   "metadata": {},
   "source": [
    "### Eval Modules\n",
    "\n",
    "We define two evaluation modules: correctness and semantic similarity - both comparing quality of predicted response with actual response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d6bfcc5-2ba9-4df9-9154-d0a544c0b4c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.evaluation.eval_utils import get_responses, get_results_df\n",
    "from llama_index.evaluation import (\n",
    "    CorrectnessEvaluator,\n",
    "    SemanticSimilarityEvaluator,\n",
    "    BatchEvalRunner,\n",
    ")\n",
    "from llama_index.llms import OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdda45f2-f97f-4083-a529-c0b924b82545",
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_llm = OpenAI(model=\"gpt-4-1106-preview\")\n",
    "eval_service_context = ServiceContext.from_defaults(llm=eval_llm)\n",
    "evaluator_c = CorrectnessEvaluator(service_context=eval_service_context)\n",
    "evaluator_s = SemanticSimilarityEvaluator(service_context=eval_service_context)\n",
    "evaluator_dict = {\n",
    "    \"correctness\": evaluator_c,\n",
    "    \"semantic_similarity\": evaluator_s,\n",
    "}\n",
    "batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3afde820-e018-41a8-ae19-a86732f404b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import time\n",
    "import os\n",
    "import pickle\n",
    "from tqdm import tqdm\n",
    "\n",
    "\n",
    "def get_responses_sync(\n",
    "    eval_qs, query_engine, show_progress=True, save_path=None\n",
    "):\n",
    "    if show_progress:\n",
    "        eval_qs_iter = tqdm(eval_qs)\n",
    "    else:\n",
    "        eval_qs_iter = eval_qs\n",
    "    pred_responses = []\n",
    "    start_time = time.time()\n",
    "    for eval_q in eval_qs_iter:\n",
    "        print(f\"eval q: {eval_q}\")\n",
    "        pred_response = agent.query(eval_q)\n",
    "        print(f\"predicted response: {pred_response}\")\n",
    "        pred_responses.append(pred_response)\n",
    "        if save_path is not None:\n",
    "            # save intermediate responses (to cache in case something breaks)\n",
    "            avg_time = (time.time() - start_time) / len(pred_responses)\n",
    "            pickle.dump(\n",
    "                {\"pred_responses\": pred_responses, \"avg_time\": avg_time},\n",
    "                open(save_path, \"wb\"),\n",
    "            )\n",
    "    return pred_responses\n",
    "\n",
    "\n",
    "async def run_evals(\n",
    "    query_engine,\n",
    "    eval_qa_pairs,\n",
    "    batch_runner,\n",
    "    disable_async_for_preds=False,\n",
    "    save_path=None,\n",
    "):\n",
    "    # then evaluate\n",
    "    # TODO: evaluate a sample of generated results\n",
    "    eval_qs = [q for q, _ in eval_qa_pairs]\n",
    "    eval_answers = [a for _, a in eval_qa_pairs]\n",
    "\n",
    "    if save_path is not None:\n",
    "        if not os.path.exists(save_path):\n",
    "            start_time = time.time()\n",
    "            if disable_async_for_preds:\n",
    "                pred_responses = get_responses_sync(\n",
    "                    eval_qs,\n",
    "                    query_engine,\n",
    "                    show_progress=True,\n",
    "                    save_path=save_path,\n",
    "                )\n",
    "            else:\n",
    "                pred_responses = get_responses(\n",
    "                    eval_qs, query_engine, show_progress=True\n",
    "                )\n",
    "            avg_time = (time.time() - start_time) / len(eval_qs)\n",
    "            pickle.dump(\n",
    "                {\"pred_responses\": pred_responses, \"avg_time\": avg_time},\n",
    "                open(save_path, \"wb\"),\n",
    "            )\n",
    "        else:\n",
    "            # [optional] load\n",
    "            pickled_dict = pickle.load(open(save_path, \"rb\"))\n",
    "            pred_responses = pickled_dict[\"pred_responses\"]\n",
    "            avg_time = pickled_dict[\"avg_time\"]\n",
    "    else:\n",
    "        start_time = time.time()\n",
    "        pred_responses = get_responses(\n",
    "            eval_qs, query_engine, show_progress=True\n",
    "        )\n",
    "        avg_time = (time.time() - start_time) / len(eval_qs)\n",
    "\n",
    "    eval_results = await batch_runner.aevaluate_responses(\n",
    "        eval_qs, responses=pred_responses, reference=eval_answers\n",
    "    )\n",
    "    return eval_results, {\"avg_time\": avg_time}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0783a8db-5546-472a-8376-6d2774dba45a",
   "metadata": {},
   "source": [
    "## Construct Assistant with Built-In Retrieval\n",
    "\n",
    "Let's construct the assistant by also passing it the built-in OpenAI Retrieval tool.\n",
    "\n",
    "Here, we upload and pass in the file during assistant-creation time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ac4421f-ca9e-4d9f-91e1-10e1fb1119e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.agent import OpenAIAssistantAgent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "304c5c23-930c-4aed-8e0b-84a6e5c36138",
   "metadata": {},
   "outputs": [],
   "source": [
    "agent = OpenAIAssistantAgent.from_new(\n",
    "    name=\"SEC Analyst\",\n",
    "    instructions=\"You are a QA assistant designed to analyze sec filings.\",\n",
    "    openai_tools=[{\"type\": \"retrieval\"}],\n",
    "    instructions_prefix=\"Please address the user as Jerry.\",\n",
    "    files=[\"data/llama2.pdf\"],\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6845f8e9-ca2c-4bd1-a31a-dc58b47c585a",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = agent.query(\n",
    "    \"What are the key differences between Llama 2 and Llama 2-Chat?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1a93a9d-cc97-49c4-8a12-de710edc2fac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The key differences between Llama 2 and Llama 2-Chat, as indicated by the document, focus on their performance in safety evaluations, particularly when tested with adversarial prompts. Here are some of the differences highlighted within the safety evaluation section of Llama 2-Chat:\n",
      "\n",
      "1. Safety Human Evaluation: Llama 2-Chat was assessed with roughly 2000 adversarial prompts, among which 1351 were single-turn and 623 were multi-turn. The responses were judged for safety violations on a five-point Likert scale, where a rating of 1 or 2 indicated a violation. The evaluation aimed to gauge the model’s safety by its rate of generating responses with safety violations and its helpfulness to users.\n",
      "\n",
      "2. Violation Percentage and Mean Rating: Llama 2-Chat exhibited a low overall violation percentage across different model sizes and a high mean rating for safety and helpfulness, which suggests a strong performance in safety evaluations.\n",
      "\n",
      "3. Inter-Rater Reliability: The reliability of the safety assessments was measured using Gwet’s AC1/2 statistic, showing a high degree of agreement among annotators with an average inter-rater reliability score of 0.92 for Llama 2-Chat annotations.\n",
      "\n",
      "4. Single-turn and Multi-turn Conversations: The evaluation revealed that multi-turn conversations generally lead to more safety violations across models, but Llama 2-Chat performed well compared to baselines, particularly in multi-turn scenarios.\n",
      "\n",
      "5. Violation Percentage Per Risk Category: Llama 2-Chat had a relatively higher number of violations in the unqualified advice category, possibly due to a lack of appropriate disclaimers in its responses.\n",
      "\n",
      "6. Improvements in Fine-Tuned Llama 2-Chat: The document also mentions that the fine-tuned Llama 2-Chat showed significant improvement over the pre-trained Llama 2 in terms of truthfulness and toxicity. The percentage of toxic generations dropped to effectively 0% for Llama 2-Chat of all sizes, which was the lowest among all compared models, indicating a notable enhancement in safety.\n",
      "\n",
      "These points detail the evaluations and improvements emphasizing safety that distinguish Llama 2-Chat from Llama 2【9†source】.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b05e4116-1ed8-4b26-a506-8357fb4409b6",
   "metadata": {},
   "source": [
    "## Benchmark\n",
    "\n",
    "We run the agent over our evaluation dataset. We benchmark against a standard top-k RAG pipeline (k=2) with gpt-4-turbo.\n",
    "\n",
    "**NOTE**: During our time of testing (November 2023), the Assistant API is heavily rate-limited, and can take ~1-2 hours to generate responses over 60 datapoints."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82aa9da3-5963-4809-abd1-1d0c51716250",
   "metadata": {},
   "source": [
    "#### Define Baseline Index + RAG Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec6c6c62-3922-40a8-81d7-5790f111107a",
   "metadata": {},
   "outputs": [],
   "source": [
    "base_sc = ServiceContext.from_defaults(llm=OpenAI(model=\"gpt-4-1106-preview\"))\n",
    "base_index = VectorStoreIndex(nodes, service_context=base_sc)\n",
    "base_query_engine = base_index.as_query_engine(similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7a01cd9-a956-403a-9bf7-134dcf4ec940",
   "metadata": {},
   "source": [
    "#### Run Evals over Baseline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ede929ea-a722-467f-853e-a996af19170d",
   "metadata": {},
   "outputs": [],
   "source": [
    "base_eval_results, base_extra_info = await run_evals(\n",
    "    base_query_engine,\n",
    "    eval_dataset.qr_pairs,\n",
    "    batch_runner,\n",
    "    save_path=\"data/llama2_preds_base.pkl\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38c33d1c-ecf8-4a9a-a273-e785535f9c15",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>names</th>\n",
       "      <th>correctness</th>\n",
       "      <th>semantic_similarity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Base Query Engine</td>\n",
       "      <td>4.05</td>\n",
       "      <td>0.964245</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               names  correctness  semantic_similarity\n",
       "0  Base Query Engine         4.05             0.964245"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "results_df = get_results_df(\n",
    "    [base_eval_results],\n",
    "    [\"Base Query Engine\"],\n",
    "    [\"correctness\", \"semantic_similarity\"],\n",
    ")\n",
    "display(results_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "822d89be-3a39-4e9f-9c4e-0893e438d31b",
   "metadata": {},
   "source": [
    "#### Run Evals over Assistant API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18fdc484-952e-4935-a241-b447c1440869",
   "metadata": {},
   "outputs": [],
   "source": [
    "assistant_eval_results, assistant_extra_info = await run_evals(\n",
    "    agent,\n",
    "    eval_dataset.qr_pairs[:55],\n",
    "    batch_runner,\n",
    "    save_path=\"data/llama2_preds_assistant.pkl\",\n",
    "    disable_async_for_preds=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb94a23e-0a9f-40d4-94f9-13dac3ff0b82",
   "metadata": {},
   "source": [
    "#### Get Results\n",
    "\n",
    "Here we see...that our basic RAG pipeline does better.\n",
    "\n",
    "Take these numbers with a grain of salt. The goal here is to give you a script so you can run this on your own data.\n",
    "\n",
    "That said it's surprising the Retrieval API doesn't give immediately better out of the box performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d7c79bb-7f29-4c8a-9a39-53621c379d70",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>names</th>\n",
       "      <th>correctness</th>\n",
       "      <th>semantic_similarity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Retrieval API</td>\n",
       "      <td>3.536364</td>\n",
       "      <td>0.952647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Base Query Engine</td>\n",
       "      <td>4.050000</td>\n",
       "      <td>0.964245</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               names  correctness  semantic_similarity\n",
       "0      Retrieval API     3.536364             0.952647\n",
       "1  Base Query Engine     4.050000             0.964245"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Base Avg Time: 0.25683316787083943\n",
      "Assistant Avg Time: 75.43605598536405\n"
     ]
    }
   ],
   "source": [
    "results_df = get_results_df(\n",
    "    [assistant_eval_results, base_eval_results],\n",
    "    [\"Retrieval API\", \"Base Query Engine\"],\n",
    "    [\"correctness\", \"semantic_similarity\"],\n",
    ")\n",
    "display(results_df)\n",
    "print(f\"Base Avg Time: {base_extra_info['avg_time']}\")\n",
    "print(f\"Assistant Avg Time: {assistant_extra_info['avg_time']}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
